我们介绍了钢筋学习的部分可观察的历史过程(POHP)形式主义。PoHP中心周围的单一代理的行动和观察以及摘要其他玩家的存在,而不将它们减少到随机过程中。我们的形式主义提供了一种简化的界面,用于设计算法,用于独家单个或多代理的分类,以及用于在这些域中应用的发展理论。我们展示了PoHP形式主义如何统一传统模型,包括马尔可夫决策过程,马尔可夫游戏,广泛的形式游戏和他们的部分可观察到的扩展,而不会引入繁琐的技术机械或违反加固学习的哲学支撑。我们通过简明地探索可观察的连续合理性,重新导出广泛形式的遗憾最小化(EFR)算法,并检查EFR在更大的理论特性的情况下进行广泛的形式的效用。
translated by 谷歌翻译
事后观察合理性是一种玩一般游戏的方法,该游戏规定了针对一组偏差的单个代理的无重格学习动态,并进一步描述了具有介导的平衡的多个代理商之间的共同理性行为。为了在依次的决策设置中发展事后理性学习,我们将行为偏差形式化为一般偏差,尊重广泛形式游戏的结构。将时间选择的概念整合到反事实遗憾的最小化(CFR)中,我们介绍了广泛的遗憾最小化(EFR)算法,该算法对于任何给定的行为偏差都具有与集合的复杂性紧密相关的计算相关的行为偏差。我们识别行为偏差子集,部分序列偏差类型,这些类型还包含先前研究的类型并导致长度中等的游戏中有效的EFR实例。此外,我们对基准游戏中不同偏差类型实例化的EFR进行了彻底的经验分析,我们发现更强大的类型通常会引起更好的性能。
translated by 谷歌翻译
在最近在两人,零和游戏中取得成功的驱动下,人工智能在游戏中的工作越来越重视产生基于平衡策略的算法。但是,这种方法在培养通用游戏或两个以上玩家的能力的玩家中的效果较小,而不是在两人游戏中的零和零游戏中。一个有吸引力的替代方法是考虑自适应算法,以确保相对于修改行为可以实现的方面的强劲表现。这种方法还导致了游戏理论分析,但是在关节学习动力学而不是均衡的代理行为引起的相关性游戏中。我们在一般的顺序决策环境中发展并倡导这一对学习的事后理性理性框架。为此,我们在广泛的游戏中重新检查了介导的平衡和偏差类型,从而获得了更完整的理解和解决过去的误解。我们提出了一组示例,说明了文献中每种平衡的独特优势和劣势,并证明没有可牵引的概念可以包含所有其他概念。这一探究线在与反事实遗憾最小化(CFR)家族中算法相对应的偏差和平衡类的定义中达到顶点,将它们与文献中的所有其他人联系起来。更详细地研究CFR进一步导致相关游戏中合理性的新递归定义,该定义以自然适用于后代评估的方式扩展了顺序合理性。
translated by 谷歌翻译
Selecting an effective training signal for tasks in natural language processing is difficult: collecting expert annotations is expensive, and crowd-sourced annotations may not be reliable. At the same time, recent work in machine learning has demonstrated that learning from soft-labels acquired from crowd annotations can be effective, especially when there is distribution shift in the test set. However, the best method for acquiring these soft labels is inconsistent across tasks. This paper proposes new methods for acquiring soft-labels from crowd-annotations by aggregating the distributions produced by existing methods. In particular, we propose to find a distribution over classes by learning from multiple-views of crowd annotations via temperature scaling and finding the Jensen-Shannon centroid of their distributions. We demonstrate that using these aggregation methods leads to best or near-best performance across four NLP tasks on out-of-domain test sets, mitigating fluctuations in performance when using the constituent methods on their own. Additionally, these methods result in best or near-best uncertainty estimation across tasks. We argue that aggregating different views of crowd-annotations as soft-labels is an effective way to ensure performance which is as good or better than the best individual view, which is useful given the inconsistency in performance of the individual methods.
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. When the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value. This lower bound is data-driven; it does not make any assumption on the data nor how it is generated. We provide code and an extensive set of numerical experiments where we use this approach to certify approximate optimality of clustering solutions obtained by k-means++.
translated by 谷歌翻译
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
translated by 谷歌翻译
我们介绍了在Neurips'22接受的Chalearn Meta学习系列中的新挑战的设计和基线结果,重点是“跨域”元学习。元学习旨在利用从以前的任务中获得的经验,以有效地解决新任务(即具有更好的性能,较少的培训数据和/或适度的计算资源)。尽管该系列中的先前挑战集中在域内几乎没有学习问题,但目的是有效地学习n-way K-shot任务(即N级培训示例的N班级分类问题),这项竞赛挑战了参与者的解决方案。从各种领域(医疗保健,生态学,生物学,制造业等)提出的“任何通道”和“任何镜头”问题,他们是为了人道主义和社会影响而被选为。为此,我们创建了Meta-Album,这是来自10个域的40个图像分类数据集的元数据,从中,我们从中以任何数量的“方式”(在2-20范围内)和任何数量的“镜头”来解释任务”(在1-20范围内)。竞争是由代码提交的,在Codalab挑战平台上进行了完全盲目测试。获奖者的代码将是开源的,从而使自动化机器学习解决方案的部署可以在几个域中进行几次图像分类。
translated by 谷歌翻译